A Corpus for Analyzing Text Reuse by People of Different Groups
نویسندگان
چکیده
Plagiarism; an un-attributed reuse of text, is very significant problem specifically for higher education institutions. Consequently, a number of automated plagiarism detection system have been developed to cater this problem. The comparison of these automated plagiarism detection systems is difficult sue to problem in collecting real cases of plagiarism by students / scholars. This paper describes development of corpus containing simulated cases of plagiarism by the people having different level of writing skills. This corpus will be a very valuable addition in the set of evaluation resources presently available for comparison of plagiarism detection systems.
منابع مشابه
The Short Stories Corpus: Notebook for PAN at CLEF 2015
In this work we describe the construction of a plagiarism detection/text reuse corpus submitted for the PAN-2015 Evaluation Lab. Our corpus consists of four different text reuse scenarios namely, (1) no-plagiarism, (2) story-retelling, (3) synonym-replacement and (4) character-substitution. Among these scenarios the most interesting one is story retelling through it we find patterns of textual ...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملCOUNTER: corpus of Urdu news text reuse
Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavail...
متن کاملApplying BLAST to Text Reuse Detection
We present the results of text reuse detection, based on the corpus of scanned and OCR-recognized Finnish newspapers and journals from 1771 to 1910. Our study draws on BLAST, a software created for comparing and aligning biological sequences. We show different types of text reuse in this corpus, and also present a comparison to the software Passim, developed at the Northeastern University in Bo...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کامل